comparing k-means clusters on parallel persian-english corpus
Authors
abstract
this paper compares clusters of aligned persian and english texts obtained from k-means method. text clustering has many applications in various fields of natural language processing. so far, much english documents clustering research has been accomplished. now this question arises, are the results of them extendable to other languages? since the goal of document clustering is grouping of documents based on their content, it is expected that the answer to this question is yes. on the other hand, many differences between various languages can cause the answer to this question to be no. this research has focused on k-means that is one of the basic and popular document clustering methods. we want to know whether the clusters of aligned persian and english texts obtained by the k-means are similar. to find an answer to this question, mizan english-persian parallel corpus was considered as benchmark. after features extraction using text mining techniques and applying the pca dimension reduction method, the k-means clustering was performed. the morphological difference between english and persian languages caused the larger feature vector length for persian. so almost in all experiments, the english results were slightly richer than those in persian. aside from these differences, the overall behavior of persian and english clusters was similar. these similar behaviors showed that results of k-means research on english can be expanded to persian. finally, there is hope that despite many differences between various languages, clustering methods may be extendable to other languages.
similar resources
Comparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
full textMIZAN: A Large Persian-English Parallel Corpus
One of the most major and essential tasks in natural language processing is machine translation that is now highly dependent upon multilingual parallel corpora. Through this paper, we introduce the biggest Persian-English parallel corpus with more than one million sentence pairs collected from masterpieces of literature. We also present acquisition process and statistics of the corpus, and expe...
full textTEP: Tehran English-Persian Parallel Corpus
Parallel corpora are one of the key resources in natural language processing. In spite of their importance in many multi-lingual applications, no large-scale English-Persian corpus has been made available so far, given the difficulties in its creation and the intensive labors required. In this paper, the construction process of Tehran English-Persian parallel corpus (TEP) using movie subtitles,...
full textPEN: Parallel English-Persian News Corpus
Parallel corpora are the necessary resources in many multilingual natural language processing applications, including machine translation and cross-lingual information retrieval. Manual preparation of a large scale parallel corpus is a very time consuming and costly procedure. In this paper, the work towards building a sentence-level aligned EnglishPersian corpus in a semi-automated manner is p...
full textExtracting an English-Persian Parallel Corpus from Comparable Corpora
Parallel data are an important part of a reliable Statistical Machine Translation (SMT) system. The more of these data are available, the better the quality of the SMT system. However, for some language pairs such as Persian-English, parallel sources of this kind are scarce. In this paper, a bidirectional method is proposed to extract parallel sentences from English and Persian document aligned...
full textCreating a Persian-English Comparable Corpus
Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in Englis...
full textMy Resources
Save resource for easier access later
Journal title:
journal of ai and data miningPublisher: shahrood university of technology
ISSN 2322-5211
volume 3
issue 2 2015
Hosted on Doprax cloud platform doprax.com
copyright © 2015-2023